import pandas
!pip install statsmodels
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/doggy-illness.csv
#Import the data from the .csv file
dataset = pandas.read_csv('doggy-illness.csv', delimiter="\t")
#Let's have a look at the data
dataset
Requirement already satisfied: statsmodels in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (0.11.0) Requirement already satisfied: scipy>=1.0 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (1.5.3) Requirement already satisfied: patsy>=0.5 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (0.5.2) Requirement already satisfied: numpy>=1.14 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (1.21.6) Requirement already satisfied: pandas>=0.21 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (1.1.5) Requirement already satisfied: six in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from patsy>=0.5->statsmodels) (1.16.0) Requirement already satisfied: python-dateutil>=2.7.3 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas>=0.21->statsmodels) (2.8.2) Requirement already satisfied: pytz>=2017.2 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas>=0.21->statsmodels) (2022.1) --2023-08-23 12:59:22-- https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 21511 (21K) [text/plain] Saving to: ‘graphing.py.1’ graphing.py.1 100%[===================>] 21.01K --.-KB/s in 0s 2023-08-23 12:59:22 (130 MB/s) - ‘graphing.py.1’ saved [21511/21511] --2023-08-23 12:59:25-- https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/doggy-illness.csv Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 3293 (3.2K) [text/plain] Saving to: ‘doggy-illness.csv.1’ doggy-illness.csv.1 100%[===================>] 3.22K --.-KB/s in 0s 2023-08-23 12:59:26 (60.3 MB/s) - ‘doggy-illness.csv.1’ saved [3293/3293]
| male | attended_training | age | body_fat_percentage | core_temperature | ate_at_tonys_steakhouse | needed_intensive_care | protein_content_of_last_meal | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 6.9 | 38 | 38.423169 | 0 | 0 | 7.66 |
| 1 | 0 | 1 | 5.4 | 32 | 39.015998 | 0 | 0 | 13.36 |
| 2 | 1 | 1 | 5.4 | 12 | 39.148341 | 0 | 0 | 12.90 |
| 3 | 1 | 0 | 4.8 | 23 | 39.060049 | 0 | 0 | 13.45 |
| 4 | 1 | 0 | 4.8 | 15 | 38.655439 | 0 | 0 | 10.53 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 93 | 0 | 0 | 4.5 | 38 | 37.939942 | 0 | 0 | 7.35 |
| 94 | 1 | 0 | 1.8 | 11 | 38.790426 | 1 | 1 | 12.18 |
| 95 | 0 | 0 | 6.6 | 20 | 39.489962 | 0 | 0 | 15.84 |
| 96 | 0 | 0 | 6.9 | 32 | 38.575742 | 1 | 1 | 9.79 |
| 97 | 1 | 1 | 6.0 | 21 | 39.766447 | 1 | 1 | 21.30 |
98 rows × 8 columns
For this exercise, we'll try to predict core_temperature from some of the other available features.
Let's quickly eyeball which features seem to have some kind of relationship with core_temperature.
import graphing # Custom graphing code that uses Plotly. See our GitHub repository for details
graphing.box_and_whisker(dataset, "male", "core_temperature", show=True)
graphing.box_and_whisker(dataset, "attended_training", "core_temperature", show=True)
graphing.box_and_whisker(dataset, "ate_at_tonys_steakhouse", "core_temperature", show=True)
graphing.scatter_2D(dataset, "body_fat_percentage", "core_temperature", show=True)
graphing.scatter_2D(dataset, "protein_content_of_last_meal", "core_temperature", show=True)
graphing.scatter_2D(dataset, "age", "core_temperature")
At a glance, fatter, older, and male dogs seem to more commonly have higher temperatures than thinner, younger, or female dogs. Dogs who ate a lot of protein last night also seem to be more unwell. The other features don't seem particularly useful.
Let's try to predict core_temperature using simple linear regression, and note the R-Squared for these relationships.
import statsmodels.formula.api as smf
import graphing # custom graphing code. See our GitHub repo for details
for feature in ["male", "age", "protein_content_of_last_meal", "body_fat_percentage"]:
# Perform linear regression. This method takes care of
# the entire fitting procedure for us.
formula = "core_temperature ~ " + feature
simple_model = smf.ols(formula = formula, data = dataset).fit()
print(feature)
print("R-squared:", simple_model.rsquared)
# Show a graph of the result
graphing.scatter_2D(dataset, label_x=feature,
label_y="core_temperature",
title = feature,
trendline=lambda x: simple_model.params[1] * x + simple_model.params[0],
show=True)
male R-squared: 0.09990074430719931
age R-squared: 0.26481160813424653
protein_content_of_last_meal R-squared: 0.9155158150005706
body_fat_percentage R-squared: 0.00020809002637733887
Scrolling through these graphs, we get R-square values of 0.0002 (body_fat_percentage), 0.1 (male), and 0.26 (age).
While protein_content_of_last_meal looks very promising too, the relationship looks curved, not linear. We'll leave this feature for now and come back to it in the next exercise.
We've shown the R-Squared value for these models and used it as a measure of "correctness" for our regression, but what is it?
Intuitively, we can think of R-Squared as ratio for how much better our regression line is than a naive regression that just goes straight through the mean of all examples.
Roughly, the R-Squared is calculated by taking the loss/error of the trained model, and dividing by the loss/error of the naive model. That gives a range where 0 is better and 1 is worse, so the whole thing is subtracted from 1 to flip those results.
In the following code, we once again show the scatter plot with age and core_temperature, but this time, we show two regression lines. The first is the naive line that just goes straight through the mean. This has an R-Squared of 0 (since it's no better than itself). An R-Squared of 1 would be a line that fit each training example perfectly. The second plot shows our trained regression line, and we once again see its R-Squared.
formula = "core_temperature ~ age"
age_trained_model = smf.ols(formula = formula, data = dataset).fit()
age_naive_model = smf.ols(formula = formula, data = dataset).fit()
age_naive_model.params[0] = dataset['core_temperature'].mean()
age_naive_model.params[1] = 0
print("naive R-squared:", age_naive_model.rsquared)
print("trained R-squared:", age_trained_model.rsquared)
# Show a graph of the result
graphing.scatter_2D(dataset, label_x="age",
label_y="core_temperature",
title = "Naive model",
trendline=lambda x: dataset['core_temperature'].repeat(len(x)),
show=True)
# Show a graph of the result
graphing.scatter_2D(dataset, label_x="age",
label_y="core_temperature",
title = "Trained model",
trendline=lambda x: age_trained_model.params[1] * x + age_trained_model.params[0])
naive R-squared: 0.0 trained R-squared: 0.26481160813424653
Instead of modeling these separately, lets try to combine these into a single model. Body fat didn't seem to be useful after all, so let's just use male and age as features.
model = smf.ols(formula = "core_temperature ~ age + male", data = dataset).fit()
print("R-squared:", model.rsquared)
R-squared: 0.31485126997680213
By using both features at the same time, we got a better result than any of the one-feature (univariate) models.
How can we view this, though? Well, a simple linear regression is drawn in 2D. If we're working with an extra variable, we add one dimension and work in 3D.
import numpy as np
# Show a graph of the result
# this needs to be 3D, because we now have three variables in play: two features and one label
def predict(age, male):
'''
This converts given age and male values into a prediction from the model
'''
# to make a prediction with statsmodels, we need to provide a dataframe
# so create a dataframe with just the age and male variables
df = pandas.DataFrame(dict(age=[age], male=[male]))
return model.predict(df)
# Create the surface graph
fig = graphing.surface(
x_values=np.array([min(dataset.age), max(dataset.age)]),
y_values=np.array([0, 1]),
calc_z=predict,
axis_title_x="Age",
axis_title_y="Male",
axis_title_z="Core temperature"
)
# Add our datapoints to it and display
fig.add_scatter3d(x=dataset.age, y=dataset.male, z=dataset.core_temperature, mode='markers')
fig.show()
The preceding graph above interactive. Try rotating it to see how the model (shown as a solid plane) would predict core temperature from different combinations of age and sex.
When we have more than two features, it becomes very difficult to visualize these models. We usually have to look at the parameters directly. Let's do that now. Statsmodels, one of the common machine learning and statistics libraries, provides a summary() method that provides information about our model.
# Print summary information
model.summary()
| Dep. Variable: | core_temperature | R-squared: | 0.315 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.300 |
| Method: | Least Squares | F-statistic: | 21.83 |
| Date: | Wed, 23 Aug 2023 | Prob (F-statistic): | 1.58e-08 |
| Time: | 13:07:17 | Log-Likelihood: | -85.295 |
| No. Observations: | 98 | AIC: | 176.6 |
| Df Residuals: | 95 | BIC: | 184.3 |
| Df Model: | 2 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 37.9793 | 0.135 | 282.094 | 0.000 | 37.712 | 38.247 |
| age | 0.1406 | 0.026 | 5.459 | 0.000 | 0.089 | 0.192 |
| male | 0.3182 | 0.121 | 2.634 | 0.010 | 0.078 | 0.558 |
| Omnibus: | 21.610 | Durbin-Watson: | 2.369 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 5.227 |
| Skew: | 0.121 | Prob(JB): | 0.0733 |
| Kurtosis: | 1.895 | Cond. No. | 12.9 |
If we look at the top right-hand corner, we can see our R-squared statistic that we printed out earlier.
Slightly down and to the left, we can also see information about the data we trained our model on. For example, we can see that we trained it on 98 observations (No. Observations).
Under this, we find information about our parameters in a column called coef (which stands for coefficients, a synonym for parameters in machine learning). Here, we can see the intercept was about 38, meaning that the model predicts a core temperature of 38 for a dog with age=0 and male=0. Underneath this, we see the parameter for age is 0.14, meaning that for each additional year of age, the predicted temperature would rise 0.14 degrees celsius. For male, we can see a parameter of 0.32, meaning that the model estimates all dogs (that is, where male == 1) to have temperatures 0.32 degrees celsius higher than female dogs (where male == 0).
Although we don't have space here to go into detail, the P column is also very useful. This tells us how confident the model is about this parameter value. As a rule of thumb, if the p-value is less than 0.05, there is a good chance that this relationship if trustable. For example, here both age and male are less than 0.05, so we should feel confident using this model in the real world.
As a final exercise, let's do the same thing with our earlier simple linear-regression model, relating age to core_temperature. Read through the following table and see what you can make out from this model.
age_trained_model.summary()
| Dep. Variable: | core_temperature | R-squared: | 0.265 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.257 |
| Method: | Least Squares | F-statistic: | 34.58 |
| Date: | Wed, 23 Aug 2023 | Prob (F-statistic): | 5.94e-08 |
| Time: | 13:13:59 | Log-Likelihood: | -88.749 |
| No. Observations: | 98 | AIC: | 181.5 |
| Df Residuals: | 96 | BIC: | 186.7 |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 38.0879 | 0.132 | 288.373 | 0.000 | 37.826 | 38.350 |
| age | 0.1533 | 0.026 | 5.880 | 0.000 | 0.102 | 0.205 |
| Omnibus: | 43.487 | Durbin-Watson: | 2.492 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 6.605 |
| Skew: | 0.087 | Prob(JB): | 0.0368 |
| Kurtosis: | 1.740 | Cond. No. | 11.3 |
We covered the following concepts in this exercise: